HDB resale price data downloaded from Data.gov.sg, consisting of over 800k resale transactions from 1990 to 2020.
2.Data cleaning and Preprocessing
3.Visual Exploratory Data Analysis
3.1 Median resale price of resale by town over the past 30 years (1990 - 2020)
3.2 Median resale price of resale by town over the past 5 years (2015 - 2020)
3.3 Median resale price of a 4 room flat over the years.
4.Data preparation and feature engineering/selection
5.Training the model and hyperparameter optimization.
5.1 Checking if Linear Regression yields good results.
5.2 Feature Importance
5.3 Using Random Forest with Hyperparameter optimization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
from datetime import datetime as dt
pip install statsmodels
Requirement already satisfied: statsmodels in /opt/anaconda3/envs/hdbprediction/lib/python3.7/site-packages (0.12.2) Requirement already satisfied: patsy>=0.5 in /opt/anaconda3/envs/hdbprediction/lib/python3.7/site-packages (from statsmodels) (0.5.1) Requirement already satisfied: pandas>=0.21 in /opt/anaconda3/envs/hdbprediction/lib/python3.7/site-packages (from statsmodels) (1.3.2) Requirement already satisfied: scipy>=1.1 in /opt/anaconda3/envs/hdbprediction/lib/python3.7/site-packages (from statsmodels) (1.7.1) Requirement already satisfied: numpy>=1.15 in /opt/anaconda3/envs/hdbprediction/lib/python3.7/site-packages (from statsmodels) (1.21.2) Requirement already satisfied: python-dateutil>=2.7.3 in /opt/anaconda3/envs/hdbprediction/lib/python3.7/site-packages (from pandas>=0.21->statsmodels) (2.8.2) Requirement already satisfied: pytz>=2017.3 in /opt/anaconda3/envs/hdbprediction/lib/python3.7/site-packages (from pandas>=0.21->statsmodels) (2021.1) Requirement already satisfied: six in /opt/anaconda3/envs/hdbprediction/lib/python3.7/site-packages (from patsy>=0.5->statsmodels) (1.16.0) Note: you may need to restart the kernel to use updated packages.
files=['resale-flat-prices-based-on-approval-date-1990-1999.csv','resale-flat-prices-based-on-approval-date-2000-feb-2012.csv','resale-flat-prices-based-on-registration-date-from-jan-2015-to-dec-2016.csv','resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv','resale-flat-prices-based-on-registration-date-from-mar-2012-to-dec-2014.csv']
for x in files:
df=pd.read_csv(x)
if x==files[0]:
dataset=df
else:
dataset=pd.concat([dataset,df])
df.head()
| month | town | flat_type | block | street_name | storey_range | floor_area_sqm | flat_model | lease_commence_date | resale_price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2012-03 | ANG MO KIO | 2 ROOM | 172 | ANG MO KIO AVE 4 | 06 TO 10 | 45.0 | Improved | 1986 | 250000.0 |
| 1 | 2012-03 | ANG MO KIO | 2 ROOM | 510 | ANG MO KIO AVE 8 | 01 TO 05 | 44.0 | Improved | 1980 | 265000.0 |
| 2 | 2012-03 | ANG MO KIO | 3 ROOM | 610 | ANG MO KIO AVE 4 | 06 TO 10 | 68.0 | New Generation | 1980 | 315000.0 |
| 3 | 2012-03 | ANG MO KIO | 3 ROOM | 474 | ANG MO KIO AVE 10 | 01 TO 05 | 67.0 | New Generation | 1984 | 320000.0 |
| 4 | 2012-03 | ANG MO KIO | 3 ROOM | 604 | ANG MO KIO AVE 5 | 06 TO 10 | 67.0 | New Generation | 1980 | 321000.0 |
dataset.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 826581 entries, 0 to 52202 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 month 826581 non-null object 1 town 826581 non-null object 2 flat_type 826581 non-null object 3 block 826581 non-null object 4 street_name 826581 non-null object 5 storey_range 826581 non-null object 6 floor_area_sqm 826581 non-null float64 7 flat_model 826581 non-null object 8 lease_commence_date 826581 non-null int64 9 resale_price 826581 non-null float64 10 remaining_lease 117527 non-null object dtypes: float64(2), int64(1), object(8) memory usage: 75.7+ MB
#after combining the multiple datasets, sorting by lastest data i.e 2020
dataset['month']=pd.to_datetime(dataset['month'])
dataset.sort_values('month',ascending=False,inplace=True)
dataset.head()
| month | town | flat_type | block | street_name | storey_range | floor_area_sqm | flat_model | lease_commence_date | resale_price | remaining_lease | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 80372 | 2020-09-01 | YISHUN | EXECUTIVE | 791 | YISHUN AVE 2 | 04 TO 06 | 146.0 | Maisonette | 1987 | 558000.0 | 66 years 03 months |
| 80367 | 2020-09-01 | YISHUN | 5 ROOM | 637 | YISHUN ST 61 | 10 TO 12 | 121.0 | Improved | 1987 | 459988.0 | 66 years 01 month |
| 80354 | 2020-09-01 | YISHUN | 5 ROOM | 796 | YISHUN RING RD | 01 TO 03 | 121.0 | Improved | 1987 | 455000.0 | 65 years 04 months |
| 80355 | 2020-09-01 | YISHUN | 5 ROOM | 159 | YISHUN ST 11 | 01 TO 03 | 121.0 | Improved | 1985 | 415000.0 | 63 years 07 months |
| 80356 | 2020-09-01 | YISHUN | 5 ROOM | 155 | YISHUN ST 11 | 01 TO 03 | 121.0 | Improved | 1985 | 465000.0 | 63 years 06 months |
dataset.columns
Index(['month', 'town', 'flat_type', 'block', 'street_name', 'storey_range',
'floor_area_sqm', 'flat_model', 'lease_commence_date', 'resale_price',
'remaining_lease'],
dtype='object')
#number of null values
dataset.isnull().sum()
month 0 town 0 flat_type 0 block 0 street_name 0 storey_range 0 floor_area_sqm 0 flat_model 0 lease_commence_date 0 resale_price 0 remaining_lease 709054 dtype: int64
#percent of null values in the remaining lease field
dataset.isnull().mean()
month 0.000000 town 0.000000 flat_type 0.000000 block 0.000000 street_name 0.000000 storey_range 0.000000 floor_area_sqm 0.000000 flat_model 0.000000 lease_commence_date 0.000000 resale_price 0.000000 remaining_lease 0.857816 dtype: float64
#creating new column to calculate remaining lease on our own to eradicate null values.
dataset['new_remaining_lease']=99-(2021-dataset['lease_commence_date'])
dataset.head()
| month | town | flat_type | block | street_name | storey_range | floor_area_sqm | flat_model | lease_commence_date | resale_price | remaining_lease | new_remaining_lease | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 80372 | 2020-09-01 | YISHUN | EXECUTIVE | 791 | YISHUN AVE 2 | 04 TO 06 | 146.0 | Maisonette | 1987 | 558000.0 | 66 years 03 months | 65 |
| 80367 | 2020-09-01 | YISHUN | 5 ROOM | 637 | YISHUN ST 61 | 10 TO 12 | 121.0 | Improved | 1987 | 459988.0 | 66 years 01 month | 65 |
| 80354 | 2020-09-01 | YISHUN | 5 ROOM | 796 | YISHUN RING RD | 01 TO 03 | 121.0 | Improved | 1987 | 455000.0 | 65 years 04 months | 65 |
| 80355 | 2020-09-01 | YISHUN | 5 ROOM | 159 | YISHUN ST 11 | 01 TO 03 | 121.0 | Improved | 1985 | 415000.0 | 63 years 07 months | 63 |
| 80356 | 2020-09-01 | YISHUN | 5 ROOM | 155 | YISHUN ST 11 | 01 TO 03 | 121.0 | Improved | 1985 | 465000.0 | 63 years 06 months | 63 |
print(dataset['town'].unique())
print(dataset['storey_range'].unique())
print(dataset['flat_type'].unique())
['YISHUN' 'TOA PAYOH' 'PASIR RIS' 'WOODLANDS' 'TAMPINES' 'SENGKANG' 'SERANGOON' 'CHOA CHU KANG' 'QUEENSTOWN' 'PUNGGOL' 'HOUGANG' 'SEMBAWANG' 'KALLANG/WHAMPOA' 'MARINE PARADE' 'JURONG EAST' 'JURONG WEST' 'GEYLANG' 'CLEMENTI' 'BUKIT MERAH' 'CENTRAL AREA' 'BUKIT BATOK' 'BUKIT PANJANG' 'BUKIT TIMAH' 'BEDOK' 'ANG MO KIO' 'BISHAN' 'LIM CHU KANG'] ['04 TO 06' '10 TO 12' '01 TO 03' '07 TO 09' '13 TO 15' '40 TO 42' '22 TO 24' '31 TO 33' '19 TO 21' '16 TO 18' '25 TO 27' '37 TO 39' '28 TO 30' '34 TO 36' '43 TO 45' '46 TO 48' '49 TO 51' '06 TO 10' '01 TO 05' '11 TO 15' '21 TO 25' '16 TO 20' '36 TO 40' '26 TO 30' '31 TO 35'] ['EXECUTIVE' '5 ROOM' '4 ROOM' '3 ROOM' '2 ROOM' 'MULTI-GENERATION' '1 ROOM' 'MULTI GENERATION']
Data in field flat_type has double data (Multi Generation vs Multi-Generation). There is a need to rectify this as the refer to the same thing rather than unique data
#cleaning flat_type field
dataset.loc[dataset['flat_type']=='MULTI GENERATION','flat_type']='MULTI-GENERATION'
print(dataset['flat_type'].unique())
['EXECUTIVE' '5 ROOM' '4 ROOM' '3 ROOM' '2 ROOM' 'MULTI-GENERATION' '1 ROOM']
# Renaming of flat model duplicates
replace_values = {'NEW GENERATION':'New Generation', 'SIMPLIFIED':'Simplified', 'STANDARD':'Standard', 'MODEL A-MAISONETTE':'Maisonette', 'MULTI GENERATION':'Multi Generation', 'IMPROVED-MAISONETTE':'Executive Maisonette', 'Improved-Maisonette':'Executive Maisonette', 'Premium Maisonette':'Executive Maisonette', '2-ROOM':'2-room', 'MODEL A':'Model A', 'MAISONETTE':'Maisonette', 'Model A-Maisonette':'Maisonette', 'IMPROVED':'Improved', 'TERRACE':'Terrace', 'PREMIUM APARTMENT':'Premium Apartment', 'Premium Apartment Loft':'Premium Apartment', 'APARTMENT':'Apartment', 'Type S1':'Type S1S2', 'Type S2':'Type S1S2'}
dataset = dataset.replace({'flat_model': replace_values})
dataset['flat_model'].value_counts()
Model A 228389 Improved 217356 New Generation 177570 Simplified 53960 Standard 39854 Premium Apartment 35066 Apartment 32004 Maisonette 28798 Model A2 9109 DBSS 1609 Adjoined flat 1085 Terrace 642 Multi Generation 502 Type S1S2 401 Executive Maisonette 196 2-room 40 Name: flat_model, dtype: int64
Standard Introduced in 1960s. Can be 1/2/3/4/5-room.Have WC and shower in same room
Improved Introduced in 1966. The 3/4-room having separate WC and shower, they also featured void decks. 5-room Improved were introduced in 1974.
New Generation Introduced in 1975, New Generation flats can be 3-Room (67 / 82 sqm) or 4-Room (92 sqm), featuring toilet for master bedroom, with pedestal type Water Closet, plus store room.
Model A Introduced in 1981: 3-Room (75 sqm), 4-Room (105 sqm), 5-Room (135 sqm), 5-Room Maisonette (139 sqm)
Model A2 Smaller units of Model A. e.g., 4-Room Model A2 (90 sqm)
Multi Generation 3Gen flats designed to meet the needs of multi-generation families.
Maisonette Model A Maisonette — 2 storeys HDB flat
Executive Maisonette A premium version of Model A Maisonettes. These units are no longer being built after being replaced by the Executive Condominium (EC) scheme in 1995
Executive Apartment Introduced in 1983 and replaced 5-Room Model A flats, in addition of the 3-bedroom and separate living/dining found in 5A flats, EA and EM feature an utility/maid room. 80% of Executive units were Maisonettes and 20% were Apartments.
Premium Apartment Introduced in the 1990s, featuring better quality finishes, in ready-to-move condition, with flooring, kitchen cabinets, built-in wardrobes upon purchase
DBSS Also known as the Design Build and Sell Scheme.They are a unique (and premium) breed of HDB flats in Singapore, which are built by private developers. High Prices. Quite similiar to Executive Condominium except DBBS is like a premium HDB without facilities of private condos and remains a HDB flat while EC can be converted to a private property over time.
Adjoined Flat Large HDB flats which are combined from 2 HDB flats
Terrace HDB terrace flats built before HDB Due to lSingapore's land constraint, this is no longer being built and offered for sale
Type S1S2 Apartments like The Pinnacle@Duxton are classified as "S" or Special apartments in view of its historical significance and award-winning design. For application of HDB policies, S1 and S2 apartments will be treated as 4-room and 5-room flats respective
2-room This refers to 2-room flexi where there is 1 bedroom and 1 common area. It can also fall under a 99 year lease scheme. This flats are meant for elderly or those with smaller family sizes.
ft = dataset['flat_type'].value_counts()/len(dataset)*100
ft
4 ROOM 37.420894 3 ROOM 32.976804 5 ROOM 20.616007 EXECUTIVE 7.578326 2 ROOM 1.193228 1 ROOM 0.154008 MULTI-GENERATION 0.060732 Name: flat_type, dtype: float64
df2 = pd.DataFrame(np.array([['4 room', 37.420894], ['3 room', 32.976804],['5 room', 20.616007],['Exec', 7.578326],['2 room', 1.193228],['1 room', 0.154008],['Multi Gen', 0.060732]]),
... columns=['flat type', 'proportion of ownership'])
#covert column from object to integer
df2['proportion of ownership']=pd.to_numeric(df2['proportion of ownership'])
df2.dtypes
flat type object proportion of ownership float64 dtype: object
# Plot the graph based on flat type ownership
fig = px.bar(df2,
x="flat type",
y="proportion of ownership",
color="flat type",
title='HDB flat type ownership over the past 30 years (from 1990 to 2020)')
# Display the graph
fig.show()
#analysing flat type ownership for recent 5 years, has it changed?
dataset['month']=pd.to_datetime(dataset['month'])
ft2 = dataset.set_index('month')['2015':'2019'].reset_index()['flat_type'].value_counts()/len(dataset.set_index('month')['2015':'2019'].reset_index())*100
ft2
4 ROOM 41.210346 3 ROOM 25.254169 5 ROOM 24.308493 EXECUTIVE 7.897721 2 ROOM 1.249396 1 ROOM 0.042403 MULTI-GENERATION 0.037472 Name: flat_type, dtype: float64
df3 = pd.DataFrame(np.array([['4 room', 41.210346], ['3 room', 25.254169],['5 room', 24.308493],['Exec', 7.897721],['2 room', 1.249396],['1 room', 0.042403],['Multi Gen', 0.037472]]),
... columns=['flat type', 'proportion of ownership'])
#covert column from object to integer
df3['proportion of ownership']=pd.to_numeric(df2['proportion of ownership'])
df3.dtypes
flat type object proportion of ownership float64 dtype: object
# Plot the graph based on flat type ownership over past 5 years
fig = px.bar(df3,
x="flat type",
y="proportion of ownership",
color="flat type",
title='HDB flat type ownership over the past 5 years (from 2015 to 2020)')
# Display the graph
fig.show()
dataset['year'] = pd.DatetimeIndex(dataset['month']).year # extract out year
d4 = dataset.groupby(['town','year','flat_type'], as_index=False).agg({'resale_price': 'median'}).sort_values('resale_price', ascending=True).reset_index(drop=True)
d4
d4.sort_values(['town','year','flat_type'],ascending=[True,True,True])
d4.rename(columns={'resale_price':'median_resale_price'})
| town | year | flat_type | median_resale_price | |
|---|---|---|---|---|
| 0 | ANG MO KIO | 1991 | 1 ROOM | 7000.0 |
| 1 | BUKIT MERAH | 1990 | 1 ROOM | 7500.0 |
| 2 | ANG MO KIO | 1990 | 1 ROOM | 8000.0 |
| 3 | ANG MO KIO | 1992 | 1 ROOM | 8250.0 |
| 4 | BUKIT MERAH | 1991 | 1 ROOM | 8950.0 |
| ... | ... | ... | ... | ... |
| 3454 | CENTRAL AREA | 2016 | 5 ROOM | 995000.0 |
| 3455 | CENTRAL AREA | 2019 | 5 ROOM | 1000000.0 |
| 3456 | QUEENSTOWN | 2020 | EXECUTIVE | 1000000.0 |
| 3457 | CENTRAL AREA | 2020 | 5 ROOM | 1040094.0 |
| 3458 | QUEENSTOWN | 2019 | EXECUTIVE | 1065000.0 |
3459 rows × 4 columns
d5=d4[d4['flat_type']=='4 ROOM'].sort_values('year',ascending=True)
towns_of_interest = ['TOA PAYOH','QUEENSTOWN', 'CENTRAL AREA','ANG MO KIO','BUKIT MERAH','BUKIT TIMAH','TAMPINES','BEDOK','HOUGANG','PUNGGOL','GEYLANG','KALLANG/WHAMPOA','CHUA CHU KANG','YIO CHU KANG','JURONG','WOODLANDS','YISHUN','SENGKANG']
import plotly.express as px
# Plot the graph
# isin() will check for the countries within the countries_of_interest list
fig = px.line(d5[d5['town'].isin(towns_of_interest)],
x="year",
y="resale_price",
color='town',
title="Median resale prices of 4 room HDB flats over the years"
)
# Display the graph
fig.show()
In recent times there seems to be a gentler increase in the median resale price, especially so for mature estates such as Toa Payoh and Central Area. This is counter intuitive as the demand for such flats should be higher. This could be due to the fact that most of these flats tend to be older hence have shorter lease. Flats in these 2 areas still have high prices but the rate of increase is much slower
dataset['flat_type'].unique()
array(['EXECUTIVE', '5 ROOM', '4 ROOM', '3 ROOM', '2 ROOM',
'MULTI-GENERATION', '1 ROOM'], dtype=object)
#omitting 1 room flats as no longer relevant
dataset2=dataset[dataset['flat_type']!='1 ROOM']
dataset2[dataset2['flat_type']=='1 ROOM']
| month | town | flat_type | block | street_name | storey_range | floor_area_sqm | flat_model | lease_commence_date | resale_price | remaining_lease | new_remaining_lease | year |
|---|
final_dataset=dataset2[['resale_price', 'town', 'flat_type', 'storey_range',
'floor_area_sqm', 'lease_commence_date', 'new_remaining_lease']]
final_dataset[['flat_type','resale_price']]
final_dataset['flat_type']=final_dataset['flat_type'].astype(str).str[0]
final_dataset[['flat_type','resale_price']]
final_dataset['flat_type'].unique()
/opt/anaconda3/envs/hdbprediction/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
array(['E', '5', '4', '3', '2', 'M'], dtype=object)
#Label encode flat type
ordinal_labels=final_dataset.groupby(['flat_type'])['resale_price'].mean().sort_values()
ordinal_labels
ordinal_labels=final_dataset.groupby(['flat_type'])['resale_price'].mean().sort_values().index
ordinal_labels
Index(['2', '3', '4', '5', 'E', 'M'], dtype='object', name='flat_type')
ordinal_labels2={k:i for i,k in enumerate(ordinal_labels,0)}
ordinal_labels2
{'2': 0, '3': 1, '4': 2, '5': 3, 'E': 4, 'M': 5}
final_dataset['flat_type_ordinal_labels']=final_dataset['flat_type'].map(ordinal_labels2)
final_dataset.head()
/opt/anaconda3/envs/hdbprediction/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| resale_price | town | flat_type | storey_range | floor_area_sqm | lease_commence_date | new_remaining_lease | flat_type_ordinal_labels | |
|---|---|---|---|---|---|---|---|---|
| 80372 | 558000.0 | YISHUN | E | 04 TO 06 | 146.0 | 1987 | 65 | 4 |
| 80367 | 459988.0 | YISHUN | 5 | 10 TO 12 | 121.0 | 1987 | 65 | 3 |
| 80354 | 455000.0 | YISHUN | 5 | 01 TO 03 | 121.0 | 1987 | 65 | 3 |
| 80355 | 415000.0 | YISHUN | 5 | 01 TO 03 | 121.0 | 1985 | 63 | 3 |
| 80356 | 465000.0 | YISHUN | 5 | 01 TO 03 | 121.0 | 1985 | 63 | 3 |
#Label encode storey_range feature
ordinal_labels_sr=final_dataset.groupby(['storey_range'])['resale_price'].mean().sort_values().index
ordinal_labels_sr
Index(['01 TO 03', '04 TO 06', '07 TO 09', '10 TO 12', '13 TO 15', '16 TO 18',
'19 TO 21', '01 TO 05', '06 TO 10', '22 TO 24', '11 TO 15', '25 TO 27',
'16 TO 20', '21 TO 25', '28 TO 30', '26 TO 30', '31 TO 35', '36 TO 40',
'31 TO 33', '34 TO 36', '37 TO 39', '40 TO 42', '46 TO 48', '49 TO 51',
'43 TO 45'],
dtype='object', name='storey_range')
ordinal_labels3={k:i for i,k in enumerate(ordinal_labels_sr,0)}
ordinal_labels3
{'01 TO 03': 0,
'04 TO 06': 1,
'07 TO 09': 2,
'10 TO 12': 3,
'13 TO 15': 4,
'16 TO 18': 5,
'19 TO 21': 6,
'01 TO 05': 7,
'06 TO 10': 8,
'22 TO 24': 9,
'11 TO 15': 10,
'25 TO 27': 11,
'16 TO 20': 12,
'21 TO 25': 13,
'28 TO 30': 14,
'26 TO 30': 15,
'31 TO 35': 16,
'36 TO 40': 17,
'31 TO 33': 18,
'34 TO 36': 19,
'37 TO 39': 20,
'40 TO 42': 21,
'46 TO 48': 22,
'49 TO 51': 23,
'43 TO 45': 24}
final_dataset['storey_range_ordinal_labels']=final_dataset['storey_range'].map(ordinal_labels3)
final_dataset.head()
/opt/anaconda3/envs/hdbprediction/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| resale_price | town | flat_type | storey_range | floor_area_sqm | lease_commence_date | new_remaining_lease | flat_type_ordinal_labels | storey_range_ordinal_labels | |
|---|---|---|---|---|---|---|---|---|---|
| 80372 | 558000.0 | YISHUN | E | 04 TO 06 | 146.0 | 1987 | 65 | 4 | 1 |
| 80367 | 459988.0 | YISHUN | 5 | 10 TO 12 | 121.0 | 1987 | 65 | 3 | 3 |
| 80354 | 455000.0 | YISHUN | 5 | 01 TO 03 | 121.0 | 1987 | 65 | 3 | 0 |
| 80355 | 415000.0 | YISHUN | 5 | 01 TO 03 | 121.0 | 1985 | 63 | 3 | 0 |
| 80356 | 465000.0 | YISHUN | 5 | 01 TO 03 | 121.0 | 1985 | 63 | 3 | 0 |
# there is a need to reduce the number of class of town to regions before we convert to one hot encoding inorder to feed into model
East=['TAMPINES','BEDOK','PASIR RIS']
North_East=['ANG MO KIO','HOUGANG','PUNGGOL','SENGKANG','SERANGOON']
North=['SEMBAWANG','WOODLANDS','YISHUN']
West=['BUKIT BATOK','BUKIT PANJANG','CHOA CHU KANG','CLEMENTI','JURONG EAST','JURONG WEST']
Central=['CENTRAL AREA','BISHAN','BUKIT MERAH','BUKIT TIMAH','GEYLANG','KALLANG/WHAMPOA','MARINE PARADE','QUEENSTOWN','TOA PAYOH']
# create a key value pair dictionary to map
town_regions = {'ANG MO KIO':'North East', 'BEDOK':'East', 'BISHAN':'Central', 'BUKIT BATOK':'West', 'BUKIT MERAH':'Central',
'BUKIT PANJANG':'West', 'BUKIT TIMAH':'Central', 'CENTRAL AREA':'Central', 'CHOA CHU KANG':'West',
'CLEMENTI':'West', 'GEYLANG':'Central', 'HOUGANG':'North East', 'JURONG EAST':'West', 'JURONG WEST':'West',
'KALLANG/WHAMPOA':'Central', 'MARINE PARADE':'Central', 'PASIR RIS':'East', 'PUNGGOL':'North East',
'QUEENSTOWN':'Central', 'SEMBAWANG':'North', 'SENGKANG':'North East', 'SERANGOON':'North East', 'TAMPINES':'East',
'TOA PAYOH':'Central', 'WOODLANDS':'North', 'YISHUN':'North','LIM CHU KANG':'North'}
final_dataset['region'] = final_dataset['town'].map(town_regions)
final_dataset.head()
/opt/anaconda3/envs/hdbprediction/lib/python3.7/site-packages/ipykernel_launcher.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| resale_price | town | flat_type | storey_range | floor_area_sqm | lease_commence_date | new_remaining_lease | flat_type_ordinal_labels | storey_range_ordinal_labels | region | |
|---|---|---|---|---|---|---|---|---|---|---|
| 80372 | 558000.0 | YISHUN | E | 04 TO 06 | 146.0 | 1987 | 65 | 4 | 1 | North |
| 80367 | 459988.0 | YISHUN | 5 | 10 TO 12 | 121.0 | 1987 | 65 | 3 | 3 | North |
| 80354 | 455000.0 | YISHUN | 5 | 01 TO 03 | 121.0 | 1987 | 65 | 3 | 0 | North |
| 80355 | 415000.0 | YISHUN | 5 | 01 TO 03 | 121.0 | 1985 | 63 | 3 | 0 | North |
| 80356 | 465000.0 | YISHUN | 5 | 01 TO 03 | 121.0 | 1985 | 63 | 3 | 0 | North |
# to encode region
ordinal_labels_re=final_dataset.groupby(['region'])['resale_price'].mean().sort_values().index
ordinal_labels_re
Index(['North', 'West', 'Central', 'East', 'North East'], dtype='object', name='region')
ordinal_labels4={k:i for i,k in enumerate(ordinal_labels_re,0)}
ordinal_labels4
{'North': 0, 'West': 1, 'Central': 2, 'East': 3, 'North East': 4}
final_dataset['region_ordinal_labels']=final_dataset['region'].map(ordinal_labels4)
final_dataset.head()
/opt/anaconda3/envs/hdbprediction/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
| resale_price | town | flat_type | storey_range | floor_area_sqm | lease_commence_date | new_remaining_lease | flat_type_ordinal_labels | storey_range_ordinal_labels | region | region_ordinal_labels | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 80372 | 558000.0 | YISHUN | E | 04 TO 06 | 146.0 | 1987 | 65 | 4 | 1 | North | 0 |
| 80367 | 459988.0 | YISHUN | 5 | 10 TO 12 | 121.0 | 1987 | 65 | 3 | 3 | North | 0 |
| 80354 | 455000.0 | YISHUN | 5 | 01 TO 03 | 121.0 | 1987 | 65 | 3 | 0 | North | 0 |
| 80355 | 415000.0 | YISHUN | 5 | 01 TO 03 | 121.0 | 1985 | 63 | 3 | 0 | North | 0 |
| 80356 | 465000.0 | YISHUN | 5 | 01 TO 03 | 121.0 | 1985 | 63 | 3 | 0 | North | 0 |
final_dataset=final_dataset[['resale_price','lease_commence_date','new_remaining_lease','flat_type_ordinal_labels','storey_range_ordinal_labels','region_ordinal_labels']]
final_dataset.head()
final_dataset.isnull().sum()
resale_price 0 lease_commence_date 0 new_remaining_lease 0 flat_type_ordinal_labels 0 storey_range_ordinal_labels 0 region_ordinal_labels 0 dtype: int64
# use scatter_matrix to see correlation of features
fig = px.scatter_matrix(final_dataset.iloc[:,1:])
fig.show()